4,875 research outputs found
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models
We express the mean and variance terms in a double exponential regression
model as additive functions of the predictors and use Bayesian variable
selection to determine which predictors enter the model, and whether they enter
linearly or flexibly. When the variance term is null we obtain a generalized
additive model, which becomes a generalized linear model if the predictors
enter the mean linearly. The model is estimated using Markov chain Monte Carlo
simulation and the methodology is illustrated using real and simulated data
sets.Comment: 8 graphs 35 page
Machine Learning for Quantum Mechanical Properties of Atoms in Molecules
We introduce machine learning models of quantum mechanical observables of
atoms in molecules. Instant out-of-sample predictions for proton and carbon
nuclear chemical shifts, atomic core level excitations, and forces on atoms
reach accuracies on par with density functional theory reference. Locality is
exploited within non-linear regression via local atom-centered coordinate
systems. The approach is validated on a diverse set of 9k small organic
molecules. Linear scaling of computational cost in system size is demonstrated
for saturated polymers with up to sub-mesoscale lengths
A Bayesian spatio-temporal model of panel design data: airborne particle number concentration in Brisbane, Australia
This paper outlines a methodology for semi-parametric spatio-temporal
modelling of data which is dense in time but sparse in space, obtained from a
split panel design, the most feasible approach to covering space and time with
limited equipment. The data are hourly averaged particle number concentration
(PNC) and were collected, as part of the Ultrafine Particles from Transport
Emissions and Child Health (UPTECH) project. Two weeks of continuous
measurements were taken at each of a number of government primary schools in
the Brisbane Metropolitan Area. The monitoring equipment was taken to each
school sequentially. The school data are augmented by data from long term
monitoring stations at three locations in Brisbane, Australia.
Fitting the model helps describe the spatial and temporal variability at a
subset of the UPTECH schools and the long-term monitoring sites. The temporal
variation is modelled hierarchically with penalised random walk terms, one
common to all sites and a term accounting for the remaining temporal trend at
each site. Parameter estimates and their uncertainty are computed in a
computationally efficient approximate Bayesian inference environment, R-INLA.
The temporal part of the model explains daily and weekly cycles in PNC at the
schools, which can be used to estimate the exposure of school children to
ultrafine particles (UFPs) emitted by vehicles. At each school and long-term
monitoring site, peaks in PNC can be attributed to the morning and afternoon
rush hour traffic and new particle formation events. The spatial component of
the model describes the school to school variation in mean PNC at each school
and within each school ground. It is shown how the spatial model can be
expanded to identify spatial patterns at the city scale with the inclusion of
more spatial locations.Comment: Draft of this paper presented at ISBA 2012 as poster, part of UPTECH
projec
High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections
We consider the problem of inferring the interactions between a set of N
binary variables from the knowledge of their frequencies and pairwise
correlations. The inference framework is based on the Hopfield model, a special
case of the Ising model where the interaction matrix is defined through a set
of patterns in the variable space, and is of rank much smaller than N. We show
that Maximum Lik elihood inference is deeply related to Principal Component
Analysis when the amp litude of the pattern components, xi, is negligible
compared to N^1/2. Using techniques from statistical mechanics, we calculate
the corrections to the patterns to the first order in xi/N^1/2. We stress that
it is important to generalize the Hopfield model and include both attractive
and repulsive patterns, to correctly infer networks with sparse and strong
interactions. We present a simple geometrical criterion to decide how many
attractive and repulsive patterns should be considered as a function of the
sampling noise. We moreover discuss how many sampled configurations are
required for a good inference, as a function of the system size, N and of the
amplitude, xi. The inference approach is illustrated on synthetic and
biological data.Comment: Physical Review E: Statistical, Nonlinear, and Soft Matter Physics
(2011) to appea
Varying-coefficient modeling via regularized basis functions
We address the problem of constructing varying-coefficient models based on
basis expansions along with the technique of regularization. A crucial point in
our modeling procedure is the selection of smoothing parameters in the
regularization method. In order to choose the parameters objectively, we derive
model selection criteria from the viewpoints of information-theoretic and
Bayesian approach. We demonstrate the effectiveness of proposed modeling
strategy through Monte Carlo simulations and analyzing a real data set.Comment: 10 pages, 4 figure
Fast stable direct fitting and smoothness selection for Generalized Additive Models
Existing computationally efficient methods for penalized likelihood GAM
fitting employ iterative smoothness selection on working linear models (or
working mixed models). Such schemes fail to converge for a non-negligible
proportion of models, with failure being particularly frequent in the presence
of concurvity. If smoothness selection is performed by optimizing `whole model'
criteria these problems disappear, but until now attempts to do this have
employed finite difference based optimization schemes which are computationally
inefficient, and can suffer from false convergence. This paper develops the
first computationally efficient method for direct GAM smoothness selection. It
is highly stable, but by careful structuring achieves a computational
efficiency that leads, in simulations, to lower mean computation times than the
schemes based on working-model smoothness selection. The method also offers a
reliable way of fitting generalized additive mixed models
How to best threshold and validate stacked species assemblages? Community optimisation might hold the answer
1. The popularity of species distribution models (SDMs) and the associated stacked species distribution models (S-SDMs), as tools for community ecologists, largely increased in recent years. However, while some consensus was reached about the best methods to threshold and evaluate individual SDMs, little agreement exists on how to best assemble individual SDMs into communities, i.e. how to build and assess S-SDM predictions.
2. Here, we used published data of insects and plants collected within the same study region to test (1) if the most established thresholding methods to optimize single species prediction are also the best choice for predicting species assemblage composition, or if community-based thresholding can be a better alternative, and (2) whether the optimal thresholding method depends on taxa, prevalence distribution and/or species richness. Based on a comparison of different evaluation approaches we provide guidelines for a robust community cross-validation framework, to use if spatial or temporal independent data are unavailable.
3. Our results showed that the selection of the “optimal” assembly strategy mostly depends on the evaluation approach rather than taxa, prevalence distribution, regional species pool or species richness. If evaluated with independent data or reliable cross-validation, community-based thresholding seems superior compared to single species optimisation. However, many published studies did not evaluate community projections with independent data, often leading to overoptimistic community evaluation metrics based on single species optimisation.
4. The fact that most of the reviewed S-SDM studies reported over-fitted community evaluation metrics highlights the importance of developing clear evaluation guidelines for community models. Here, we move a first step in this direction, providing a framework for cross-validation at the community level
- …